Summary
This workflow demonstrates the mitosRNAseq pipeline which identifies several types of small RNA from sequencing data from short reads of <50 nt. The following method collects sequence-based counts, sequence locations, genomic features, and annotations. The data we use is from killifish embyros anoxia experiments.
The step-by-step workflow we follow is illistrated below and a detailed method to perform small RNA seq on an academic HPC is decsribed in this document.

The following table shows the contents of the file samples.csv that contains metadata for our samples. We start with a total of 40 fastq files.
Sample Table
|
#
|
Sample
|
Condition
|
Stage
|
Input files
|
|
1
|
31_4d_t0
|
4d_t0
|
4
|
31_4d_t0.fastq.gz
|
|
2
|
32_4d_t0
|
4d_t0
|
4
|
32_4d_t0.fastq.gz
|
|
3
|
33_4d_t0
|
4d_t0
|
4
|
33_4d_t0.fastq.gz
|
|
4
|
34_4d_t0
|
4d_t0
|
4
|
34_4d_t0.fastq.gz
|
|
5
|
35_4d_4hrA
|
4d_4hrA
|
4
|
35_4d_4hrA.fastq.gz
|
|
6
|
36_4d_4hrA
|
4d_4hrA
|
4
|
36_4d_4hrA.fastq.gz
|
|
7
|
37_4d_4hrA
|
4d_4hrA
|
4
|
37_4d_4hrA.fastq.gz
|
|
8
|
38_4d_4hrA
|
4d_4hrA
|
4
|
38_4d_4hrA.fastq.gz
|
|
9
|
39_4d_24hrA
|
4d_24hrA
|
4
|
39_4d_24hrA.fastq.gz
|
|
10
|
40_4d_24hrA
|
4d_24hrA
|
4
|
40_4d_24hrA.fastq.gz
|
|
11
|
41_4d_24hrA
|
4d_24hrA
|
4
|
41_4d_24hrA.fastq.gz
|
|
12
|
42_4d_24hrA
|
4d_24hrA
|
4
|
42_4d_24hrA.fastq.gz
|
|
13
|
43_4d_2hrR
|
4d_2hrR
|
4
|
43_4d_2hrR.fastq.gz
|
|
14
|
44_4d_2hrR
|
4d_2hrR
|
4
|
44_4d_2hrR.fastq.gz
|
|
15
|
45_4d_2hrR
|
4d_2hrR
|
4
|
45_4d_2hrR.fastq.gz
|
|
16
|
46_4d_2hrR
|
4d_2hrR
|
4
|
46_4d_2hrR.fastq.gz
|
|
17
|
47_4d_24hrR
|
4d_24hrR
|
4
|
47_4d_24hrR.fastq.gz
|
|
18
|
48_4d_24hrR
|
4d_24hrR
|
4
|
48_4d_24hrR.fastq.gz
|
|
19
|
49_4d_24hrR
|
4d_24hrR
|
4
|
49_4d_24hrR.fastq.gz
|
|
20
|
50_4d_24hrR
|
4d_24hrR
|
4
|
50_4d_24hrR.fastq.gz
|
|
21
|
1_12d_t0
|
12d_t0
|
12
|
1_12d_t0.fastq.gz
|
|
22
|
2_12d_t0
|
12d_t0
|
12
|
2_12d_t0.fastq.gz
|
|
23
|
5_12d_t0
|
12d_t0
|
12
|
5_12d_t0.fastq.gz
|
|
24
|
6_12d_t0
|
12d_t0
|
12
|
6_12d_t0.fastq.gz
|
|
25
|
10_12d_4hrA
|
12d_4hrA
|
12
|
10_12d_4hrA.fastq.gz
|
|
26
|
11_12d_4hrA
|
12d_4hrA
|
12
|
11_12d_4hrA.fastq.gz
|
|
27
|
12_12d_4hrA
|
12d_4hrA
|
12
|
12_12d_4hrA.fastq.gz
|
|
28
|
7_12d_4hrA
|
12d_4hrA
|
12
|
7_12d_4hrA.fastq.gz
|
|
29
|
13_12d_24hrA
|
12d_24hrA
|
12
|
13_12d_24hrA.fastq.gz
|
|
30
|
16_12d_24hrA
|
12d_24hrA
|
12
|
16_12d_24hrA.fastq.gz
|
|
31
|
17_12d_24hrA
|
12d_24hrA
|
12
|
17_12d_24hrA.fastq.gz
|
|
32
|
18_12d_24hrA
|
12d_24hrA
|
12
|
18_12d_24hrA.fastq.gz
|
|
33
|
19_12d_2hrR
|
12d_2hrR
|
12
|
19_12d_2hrR.fastq.gz
|
|
34
|
21_12d_2hrR
|
12d_2hrR
|
12
|
21_12d_2hrR.fastq.gz
|
|
35
|
22_12d_2hrR
|
12d_2hrR
|
12
|
22_12d_2hrR.fastq.gz
|
|
36
|
23_12d_2hrR
|
12d_2hrR
|
12
|
23_12d_2hrR.fastq.gz
|
|
37
|
25_12d_24hrR
|
12d_24hrR
|
12
|
25_12d_24hrR.fastq.gz
|
|
38
|
28_12d_24hrR
|
12d_24hrR
|
12
|
28_12d_24hrR.fastq.gz
|
|
39
|
29_12d_24hrR
|
12d_24hrR
|
12
|
29_12d_24hrR.fastq.gz
|
|
40
|
30_12d_24hrR
|
12d_24hrR
|
12
|
30_12d_24hrR.fastq.gz
|
QC for raw reads
Quality Control (QC) is done by running fastQC on raw reads to evaluate how good (or bad) our data looks.
Install software for Quality Check
conda
Set up a virtual environment in conda that has all of the Quality Control (QC) tools. We are using Miniconda. For creating sharable environments, create .yaml files for each environment you’re making.
conda create -n condafastqc fastqc
conda activate condafastqc
conda install -n multiqc
tmux
For almost every long-ish process, we use tmux to run tasks in a detached terminal window that can keep running if we want to log out or if we accidently log out.
tmux new -s fastqc
Executing
Creating a new file called fastqcr.R that has the script to run fastqc through R. We use this script to automate the process for a list of files in a directory. Running commands through various scripts also keeps a record of our commands.
if (!require("fastqcr")) {
install.packages("fastqcr", repos="http://cran.us.r-project.org")
library(fastqcr)
} else {library(fastqcr)}
fastqc(fq.dir="/path/to/directory/init_fastq_files",
qc.dir="/path/to/directory/fastqc/init_fastqc",
threads=20)
Running the script with the following command:
Rscript fastqcr.R
Press Ctrl+B and then D for detaching this tmux window. You can log out from your remote computer (HPC) and close your terminal window now while this process keeps running on your remote computer. To get back to this window, log in to your remote computer, and use the command tmux a -t fastqc. F or more information, check tmux.
We will get a fastqc HTML file in the output fastqc directory for each fastq sequence in the input init_fastq_files directory.
FastQC Output
General Statistics from multiqc
| Sample |
PercentDups |
PercentGC |
BPMedianReadLength |
MillionsSeqs |
| 10_12d_4hrA |
94.1 |
51 |
36 |
12.6 |
| 11_12d_4hrA |
92.7 |
51 |
36 |
16.2 |
| 12_12d_4hrA |
87.4 |
52 |
36 |
9.0 |
| 13_12d_24hrA |
90.5 |
53 |
36 |
8.5 |
| 16_12d_24hrA |
91.1 |
50 |
36 |
7.9 |
| 17_12d_24hrA |
92.4 |
50 |
36 |
29.9 |
| 18_12d_24hrA |
84.6 |
52 |
36 |
11.8 |
| 19_12d_2hrR |
87.6 |
52 |
36 |
11.6 |
| 1_12d_t0 |
86.4 |
51 |
36 |
8.7 |
| 21_12d_2hrR |
95.4 |
54 |
36 |
9.8 |
| 22_12d_2hrR |
80.5 |
52 |
36 |
6.7 |
| 23_12d_2hrR |
89.3 |
52 |
36 |
7.8 |
| 25_12d_24hrR |
88.1 |
53 |
36 |
13.5 |
| 28_12d_24hrR |
91.1 |
55 |
36 |
5.9 |
| 29_12d_24hrR |
91.4 |
54 |
36 |
8.3 |
| 2_12d_t0 |
91.5 |
54 |
36 |
8.3 |
| 30_12d_24hrR |
79.9 |
53 |
36 |
12.8 |
| 31_4d_t0 |
79.3 |
52 |
50 |
11.9 |
| 32_4d_t0 |
80.1 |
53 |
50 |
11.6 |
| 33_4d_t0 |
86.3 |
51 |
50 |
13.7 |
| 34_4d_t0 |
82.9 |
52 |
50 |
15.2 |
| 35_4d_4hrA |
88.4 |
52 |
50 |
11.0 |
| 36_4d_4hrA |
87.0 |
51 |
50 |
13.8 |
| 37_4d_4hrA |
84.7 |
52 |
50 |
15.6 |
| 38_4d_4hrA |
84.4 |
51 |
50 |
15.3 |
| 39_4d_24hrA |
79.7 |
53 |
50 |
12.4 |
| 40_4d_24hrA |
74.4 |
51 |
50 |
12.9 |
| 41_4d_24hrA |
79.3 |
52 |
50 |
9.7 |
| 42_4d_24hrA |
85.6 |
53 |
50 |
16.7 |
| 43_4d_2hrR |
88.3 |
52 |
50 |
13.0 |
| 44_4d_2hrR |
81.5 |
44 |
101 |
11.0 |
| 45_4d_2hrR |
87.9 |
44 |
101 |
13.8 |
| 46_4d_2hrR |
88.9 |
52 |
50 |
16.9 |
| 47_4d_24hrR |
72.8 |
45 |
101 |
14.6 |
| 48_4d_24hrR |
83.2 |
53 |
50 |
13.4 |
| 49_4d_24hrR |
87.1 |
51 |
50 |
22.8 |
| 50_4d_24hrR |
88.9 |
41 |
101 |
14.8 |
| 5_12d_t0 |
81.3 |
53 |
36 |
11.2 |
| 6_12d_t0 |
81.7 |
53 |
36 |
13.6 |
| 7_12d_4hrA |
81.4 |
52 |
36 |
10.5 |
Sequence Counts for each sample
The mean quality scores per base in the read
The average quality scores per sequence
The GC content per sequence
Sequence Duplication
Overrepresented sequences
Adapter Content
Overall Pass/Fail
Sequence length distribution
All samples had sequences of a single length (50bp , 36bp , 101bp).
Trimming
Trimming is done with Trimmomatic in our case, however, other tools like cutadapt or fastp may also be used. This tool runs through the following bash script
FILES="/disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/copied_old_infiles/init_fastq_files/*.fastq.gz"
for f in $FILES
do
b=`basename $f`
echo Running trimming for the file $b
c=${b::-9}
o="$c.trim.fastq"
log="$c.out.log"
c="$c.log.txt"
java -jar /disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/trim/Trimmomatic-0.39/trimmomatic-0.39.jar SE -threads 20 -phred33 -trimlog $c $f $o ILLUMINACLIP:/disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/copied_old_infiles/adapter_contamination_sequences_AR3.txt:2:30:5:1:true SLIDINGWINDOW:5:15 LEADING:20 TRAILING:20 MINLEN:15 2> $log
done
| Sample |
TotalReads |
SurvivingReads |
PercentSurviving |
DroppedReads |
PercentDropped |
| 10_12d_4hrA |
12625447 |
4421375 |
35.0 |
8204072 |
65.0 |
| 11_12d_4hrA |
16238016 |
8893201 |
54.8 |
7344815 |
45.2 |
| 12_12d_4hrA |
9024387 |
7219072 |
80.0 |
1805315 |
20.0 |
| 13_12d_24hrA |
8467459 |
5182199 |
61.2 |
3285260 |
38.8 |
| 16_12d_24hrA |
7908381 |
6458185 |
81.7 |
1450196 |
18.3 |
| 17_12d_24hrA |
29872147 |
17868683 |
59.8 |
12003464 |
40.2 |
| 18_12d_24hrA |
11793810 |
8182929 |
69.4 |
3610881 |
30.6 |
| 19_12d_2hrR |
11606761 |
9574385 |
82.5 |
2032376 |
17.5 |
| 1_12d_t0 |
8698124 |
6498924 |
74.7 |
2199200 |
25.3 |
| 21_12d_2hrR |
9784686 |
3585294 |
36.6 |
6199392 |
63.4 |
| 22_12d_2hrR |
6741253 |
4728213 |
70.1 |
2013040 |
29.9 |
| 23_12d_2hrR |
7836111 |
4455897 |
56.9 |
3380214 |
43.1 |
| 25_12d_24hrR |
13534542 |
8957294 |
66.2 |
4577248 |
33.8 |
| 28_12d_24hrR |
5866632 |
2402889 |
41.0 |
3463743 |
59.0 |
| 29_12d_24hrR |
8305061 |
2923930 |
35.2 |
5381131 |
64.8 |
| 2_12d_t0 |
8344087 |
6946381 |
83.2 |
1397706 |
16.8 |
| 30_12d_24hrR |
12811149 |
8785705 |
68.6 |
4025444 |
31.4 |
| 31_4d_t0 |
11936271 |
10306761 |
86.3 |
1629510 |
13.7 |
| 32_4d_t0 |
11595360 |
7974434 |
68.8 |
3620926 |
31.2 |
| 33_4d_t0 |
13655749 |
10422782 |
76.3 |
3232967 |
23.7 |
| 34_4d_t0 |
15201813 |
11186661 |
73.6 |
4015152 |
26.4 |
| 35_4d_4hrA |
11045591 |
7821501 |
70.8 |
3224090 |
29.2 |
| 36_4d_4hrA |
13840101 |
9383343 |
67.8 |
4456758 |
32.2 |
| 37_4d_4hrA |
15579833 |
12373672 |
79.4 |
3206161 |
20.6 |
| 38_4d_4hrA |
15286687 |
12142640 |
79.4 |
3144047 |
20.6 |
| 39_4d_24hrA |
12442304 |
9755623 |
78.4 |
2686681 |
21.6 |
| 40_4d_24hrA |
12866355 |
11130799 |
86.5 |
1735556 |
13.5 |
| 41_4d_24hrA |
9720710 |
8032689 |
82.6 |
1688021 |
17.4 |
| 42_4d_24hrA |
16698310 |
12979297 |
77.7 |
3719013 |
22.3 |
| 43_4d_2hrR |
12989273 |
9158318 |
70.5 |
3830955 |
29.5 |
| 44_4d_2hrR |
10968948 |
8421811 |
76.8 |
2547137 |
23.2 |
| 45_4d_2hrR |
13812376 |
6527219 |
47.3 |
7285157 |
52.7 |
| 46_4d_2hrR |
16946884 |
12011097 |
70.9 |
4935787 |
29.1 |
| 47_4d_24hrR |
14627707 |
12933658 |
88.4 |
1694049 |
11.6 |
| 48_4d_24hrR |
13383720 |
10269804 |
76.7 |
3113916 |
23.3 |
| 49_4d_24hrR |
22803428 |
18172334 |
79.7 |
4631094 |
20.3 |
| 50_4d_24hrR |
14759393 |
7495120 |
50.8 |
7264273 |
49.2 |
| 5_12d_t0 |
11217968 |
7594023 |
67.7 |
3623945 |
32.3 |
| 6_12d_t0 |
13607922 |
9592402 |
70.5 |
4015520 |
29.5 |
| 7_12d_4hrA |
10489257 |
6515199 |
62.1 |
3974058 |
37.9 |
QC for trimmed reads
Quality Control (QC) is done for the trimmed reads with fastQC. We activate the conda environment condafastqc for this process.
if (!require("fastqcr")) {
install.packages("fastqcr", repos="http://cran.us.r-project.org")
library(fastqcr)
} else {library(fastqcr)}
fastqc(fq.dir="/path/to/directory/trimmed_fastq_files",
qc.dir="/path/to/directory/fastqc/trim_fastqc",
threads=20)
Running the script with the following command:
Rscript fastqcr.R
FastQC Output
Plots for each FASTQC metric after trimming is shown below. All the adapters were removed and overall quality looks better.
General Statistics from multiqc
| Sample |
PercentDups |
PercentGC |
BPMedianReadLength |
MillionsSeqs |
| 10_12d_4hrA |
89.4 |
50 |
22 |
4.4 |
| 11_12d_4hrA |
90.3 |
49 |
22 |
8.9 |
| 12_12d_4hrA |
88.7 |
51 |
22 |
7.2 |
| 13_12d_24hrA |
88.4 |
50 |
22 |
5.2 |
| 16_12d_24hrA |
91.9 |
48 |
22 |
6.5 |
| 17_12d_24hrA |
91.2 |
48 |
22 |
17.9 |
| 18_12d_24hrA |
84.9 |
50 |
22 |
8.2 |
| 19_12d_2hrR |
88.5 |
51 |
22 |
9.6 |
| 1_12d_t0 |
86.7 |
49 |
22 |
6.5 |
| 21_12d_2hrR |
90.5 |
47 |
22 |
3.6 |
| 22_12d_2hrR |
80.9 |
50 |
22 |
4.7 |
| 23_12d_2hrR |
86.9 |
49 |
22 |
4.5 |
| 25_12d_24hrR |
86.6 |
50 |
22 |
9.0 |
| 28_12d_24hrR |
84.3 |
50 |
22 |
2.4 |
| 29_12d_24hrR |
83.9 |
50 |
22 |
2.9 |
| 2_12d_t0 |
93.8 |
53 |
22 |
6.9 |
| 30_12d_24hrR |
78.7 |
51 |
22 |
8.8 |
| 31_4d_t0 |
80.8 |
49 |
22 |
10.3 |
| 32_4d_t0 |
77.1 |
50 |
22 |
8.0 |
| 33_4d_t0 |
86.5 |
48 |
22 |
10.4 |
| 34_4d_t0 |
81.7 |
51 |
22 |
11.2 |
| 35_4d_4hrA |
86.5 |
50 |
22 |
7.8 |
| 36_4d_4hrA |
82.7 |
49 |
22 |
9.4 |
| 37_4d_4hrA |
84.8 |
51 |
22 |
12.4 |
| 38_4d_4hrA |
84.2 |
48 |
22 |
12.1 |
| 39_4d_24hrA |
79.1 |
51 |
23 |
9.8 |
| 40_4d_24hrA |
73.6 |
50 |
27 |
11.1 |
| 41_4d_24hrA |
79.0 |
49 |
22 |
8.0 |
| 42_4d_24hrA |
85.3 |
50 |
22 |
13.0 |
| 43_4d_2hrR |
86.9 |
50 |
22 |
9.2 |
| 44_4d_2hrR |
82.1 |
50 |
20 |
8.4 |
| 45_4d_2hrR |
80.0 |
50 |
20 |
6.5 |
| 46_4d_2hrR |
88.4 |
48 |
22 |
12.0 |
| 47_4d_24hrR |
73.3 |
48 |
26 |
12.9 |
| 48_4d_24hrR |
82.6 |
52 |
22 |
10.3 |
| 49_4d_24hrR |
87.9 |
49 |
22 |
18.2 |
| 50_4d_24hrR |
84.4 |
49 |
20 |
7.5 |
| 5_12d_t0 |
81.2 |
50 |
22 |
7.6 |
| 6_12d_t0 |
82.7 |
51 |
22 |
9.6 |
| 7_12d_4hrA |
79.5 |
50 |
22 |
6.5 |
Sequence Counts for each sample
The mean quality scores per base in the read
The average quality scores per sequence
The GC content per sequence
Sequence Duplication
Overrepresented sequences
Overall Pass/Fail
Sequence length distribution
Alignment
The trimmed reads are aligned to the reference genome using bowtie. To analyze reads mapping with mitochondrial genome, we perform the alignment in two steps.
- Complete Genome (with mitochondrial genome)- with algenome.fa
- Only the Mitochondrial genome which is extracted from genome.fa- almitogenome.fa
To begin alignment, we make bowtie indexes, this is done in two sets.
bowtie-index algenome.fa
bowtie-index almitogenome.fa
Alignment with genome
The alignment was done on the trimmed reads from previous section.
#!/bin/bash
FILES="/disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/trim/*.fastq"
for f in $FILES
do
b=`basename $f`
echo Running trimming for the file $b
t=${b::-6}
o="$t.sam"
uo="$t.mapped.fq"
echo Output file is $o
bowtie \
-p 20 \
-t \
-k 5 \
--best \
--strata \
-e 99999 \
-v 0 \
-l 15 \
--chunkmbs 2048 \
-x /disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/bowtie_index_alim/algenome \
-q $f \
--al $uo \
--sam --no-unal > $o 2> $t.bowtie.log
done
#done
The following stats show alignment QC.
This table shows the number of reads aligned in each sample (MillionAlignedReads). It also shows the number of aligments of various reads that includes multimapped reads (MillionMappedReads).
|
Sample
|
PercentAligned
|
MillionAlignedReads
|
MillionMappedReads
|
|
10_12d_4hrA
|
73.2
|
3.2
|
8.0
|
|
11_12d_4hrA
|
73.6
|
6.5
|
15.4
|
|
12_12d_4hrA
|
70.4
|
5.1
|
12.4
|
|
13_12d_24hrA
|
69.6
|
3.6
|
8.7
|
|
16_12d_24hrA
|
77.5
|
5.0
|
10.3
|
|
17_12d_24hrA
|
75.8
|
13.5
|
31.4
|
|
18_12d_24hrA
|
73.7
|
6.0
|
15.0
|
|
19_12d_2hrR
|
71.9
|
6.9
|
16.8
|
|
1_12d_t0
|
73.3
|
4.8
|
11.1
|
|
21_12d_2hrR
|
77.8
|
2.8
|
5.3
|
|
22_12d_2hrR
|
66.2
|
3.1
|
7.4
|
|
23_12d_2hrR
|
73.2
|
3.3
|
7.8
|
|
25_12d_24hrR
|
69.7
|
6.2
|
15.0
|
|
28_12d_24hrR
|
65.3
|
1.6
|
3.7
|
|
29_12d_24hrR
|
70.9
|
2.1
|
5.0
|
|
2_12d_t0
|
76.9
|
5.3
|
13.1
|
|
30_12d_24hrR
|
65.5
|
5.8
|
13.8
|
|
31_4d_t0
|
62.0
|
6.4
|
11.8
|
|
32_4d_t0
|
73.5
|
5.9
|
12.2
|
|
33_4d_t0
|
80.6
|
8.4
|
15.0
|
|
34_4d_t0
|
70.5
|
7.9
|
15.8
|
|
35_4d_4hrA
|
71.5
|
5.6
|
9.2
|
|
36_4d_4hrA
|
62.1
|
5.8
|
8.3
|
|
37_4d_4hrA
|
69.1
|
8.6
|
16.4
|
|
38_4d_4hrA
|
76.1
|
9.2
|
15.4
|
|
39_4d_24hrA
|
79.0
|
7.7
|
17.2
|
|
40_4d_24hrA
|
52.3
|
5.8
|
10.9
|
|
41_4d_24hrA
|
63.0
|
5.1
|
8.8
|
|
42_4d_24hrA
|
66.6
|
8.6
|
15.2
|
|
43_4d_2hrR
|
71.2
|
6.5
|
11.7
|
|
44_4d_2hrR
|
64.6
|
5.4
|
11.8
|
|
45_4d_2hrR
|
65.7
|
4.3
|
9.5
|
|
46_4d_2hrR
|
71.2
|
8.6
|
13.8
|
|
47_4d_24hrR
|
80.1
|
10.4
|
24.4
|
|
48_4d_24hrR
|
51.8
|
5.3
|
10.7
|
|
49_4d_24hrR
|
72.3
|
13.1
|
23.2
|
|
50_4d_24hrR
|
79.4
|
6.0
|
13.4
|
|
5_12d_t0
|
66.2
|
5.0
|
11.9
|
|
6_12d_t0
|
67.5
|
6.5
|
15.5
|
|
7_12d_4hrA
|
65.6
|
4.3
|
10.1
|
Alignment with mitochondria
The alignment was done on the trimmed reads from previous section.
The following stats show alignment QC.
#!/bin/bash
FILES="/disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/trim/*.fastq"
for f in $FILES
do
b=`basename $f`
echo Running alignment for the file $b
t=${b::-6}
o="$t.sam"
uo="$t.mapped.fq"
echo Output file is $o
bowtie \
-p 20 \
-t \
-k 10 \
--best \
--strata \
-e 99999 \
-v 0 \
-l 15 \
--chunkmbs 2048 \
-x /disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/bowtie_mito_index/almitogenome \
-q $f \
--al $uo \
--sam --no-unal > $o 2> $t.bowtie.log
done
#done
This table shows the number of reads aligned in each sample (MillionAlignedReads). It also shows the number of aligments of various reads that includes multimapped reads (MillionMappedReads).
|
Sample
|
PercentAligned
|
MillionAlignedReads
|
MillionMappedReads
|
|
10_12d_4hrA
|
0.34
|
0.01
|
0.02
|
|
11_12d_4hrA
|
0.47
|
0.04
|
0.04
|
|
12_12d_4hrA
|
0.41
|
0.03
|
0.03
|
|
13_12d_24hrA
|
0.50
|
0.03
|
0.03
|
|
16_12d_24hrA
|
0.52
|
0.03
|
0.03
|
|
17_12d_24hrA
|
3.25
|
0.58
|
0.59
|
|
18_12d_24hrA
|
6.22
|
0.51
|
0.52
|
|
19_12d_2hrR
|
0.92
|
0.09
|
0.09
|
|
1_12d_t0
|
1.25
|
0.08
|
0.08
|
|
21_12d_2hrR
|
0.79
|
0.03
|
0.03
|
|
22_12d_2hrR
|
4.43
|
0.21
|
0.21
|
|
23_12d_2hrR
|
2.68
|
0.12
|
0.12
|
|
25_12d_24hrR
|
4.72
|
0.42
|
0.43
|
|
28_12d_24hrR
|
0.32
|
0.01
|
0.01
|
|
29_12d_24hrR
|
4.08
|
0.12
|
0.12
|
|
2_12d_t0
|
0.25
|
0.02
|
0.02
|
|
30_12d_24hrR
|
8.18
|
0.72
|
0.73
|
|
31_4d_t0
|
4.99
|
0.51
|
0.52
|
|
32_4d_t0
|
10.12
|
0.81
|
0.82
|
|
33_4d_t0
|
1.99
|
0.21
|
0.21
|
|
34_4d_t0
|
6.72
|
0.75
|
0.77
|
|
35_4d_4hrA
|
0.43
|
0.03
|
0.03
|
|
36_4d_4hrA
|
0.59
|
0.06
|
0.06
|
|
37_4d_4hrA
|
1.10
|
0.14
|
0.14
|
|
38_4d_4hrA
|
5.67
|
0.69
|
0.70
|
|
39_4d_24hrA
|
8.32
|
0.81
|
0.83
|
|
40_4d_24hrA
|
7.77
|
0.87
|
0.88
|
|
41_4d_24hrA
|
7.50
|
0.60
|
0.61
|
|
42_4d_24hrA
|
2.13
|
0.28
|
0.28
|
|
43_4d_2hrR
|
0.61
|
0.06
|
0.06
|
|
44_4d_2hrR
|
1.00
|
0.08
|
0.09
|
|
45_4d_2hrR
|
4.69
|
0.31
|
0.31
|
|
46_4d_2hrR
|
1.07
|
0.13
|
0.13
|
|
47_4d_24hrR
|
16.68
|
2.16
|
2.19
|
|
48_4d_24hrR
|
0.18
|
0.02
|
0.02
|
|
49_4d_24hrR
|
1.49
|
0.27
|
0.28
|
|
50_4d_24hrR
|
2.22
|
0.17
|
0.17
|
|
5_12d_t0
|
6.11
|
0.46
|
0.47
|
|
6_12d_t0
|
6.47
|
0.62
|
0.63
|
|
7_12d_4hrA
|
5.39
|
0.35
|
0.36
|
Using aligned reads
Coming Soon
miRTrace
Running mirtrace with the following command.
#!/bin/bash
mirtrace qc --species dre --config seq_address.txt
Looking at the results:

sports
Executing sports tool on mapped reads only. We use the following command to run:
#!/bin/bash
sports.pl -i seqs_alim_mito.txt \
-o /disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/align_with_alimnaeus/gen_input_fastq/SPORTS/out_sports/REDO/mito_mapped_reads \
-k \
-M 2 \
-p 20 \
-g /disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/bowtie_mito_index/almitogenome \
-m /disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/align_with_alimnaeus/gen_input_fastq/SPORTS/sports1.1/db/Danio_rerio/miRBase_21/miRBase_21-dre \
-r /disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/align_with_alimnaeus/gen_input_fastq/SPORTS/sports1.1/db/Danio_rerio/rRNA_db/zebrafish_rRNA \
-t /disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/align_with_alimnaeus/gen_input_fastq/SPORTS/sports1.1/db/Danio_rerio/GtRNAdb/danRer6-tRNAs \
-w /disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/align_with_alimnaeus/gen_input_fastq/SPORTS/sports1.1/db/Danio_rerio/piRBase/piR_dre_v1.0 \
-e /disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/align_with_alimnaeus/gen_input_fastq/SPORTS/sports1.1/db/Danio_rerio/Ensembl/Danio_rerio.GRCz10.ncrna \
-f /disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/align_with_alimnaeus/gen_input_fastq/SPORTS/sports1.1/db/Danio_rerio/Rfam_12.3/Rfam-12.3-zebrafish
This will generate a number of output files, from which the tables generated as a result file *_output.txt for each sample are taken and are used to extract annotation of mapped reads. Top 50 rows of this table are shown for the sample 10_12d_4hrA below.

It also generates a number of plots to categorize the annotations and their length distributions for each sample.

Compiling output files
Coming Soon
Visualizing
We use the alignments with mitochondrial genome to view the aligned reads.
Jbrowse is used to explore alignments.
Files used:
- Aligned bam files for each sample and its corresponding index file (bam, bai)
- Reference genome for mitochondria (.fasta) and its index file (.fai)
- Genome feature file (.gff) for mitochondrial genome

This figure shows mitochondiral genome of 100bp length from 17100 to 17199. On this genome, we see 1) the colored basepairs, 2) Genomic features of the mentioned length, 3) Alignments from 4 samples (4d t0 replicates) that show reads aligned to the genome colored by strand while also showing the coverage histogram in grey.
---
title: "mitosRNAseq Workflow"
date: 08-15-2024
output: 
    html_notebook:
        toc: TRUE
        toc_float: true
---


##  Summary
This workflow demonstrates the mitosRNAseq pipeline which identifies several types of
small RNA from sequencing data from short reads of <50 nt. The following method collects sequence-based 
counts, sequence locations, genomic features, and annotations. 
The data we use is from killifish embyros anoxia experiments. 

The step-by-step workflow we follow is illistrated below and a 
detailed method to perform small RNA seq on an academic HPC is decsribed in this document. 

![](/home/gazal/Documents/PSU_work/sRNA_gazal/documentation/Figure 1.png)

The following table shows the contents of the file **samples.csv** that contains metadata for our samples.
We start with a total of 40 fastq files.

### Sample Table
```{r echo=FALSE, results='asis', warning=FALSE}
library(knitr)
library(readr)
library(formattable)
library(readr)
df_samples <- read_csv("samples.csv", show_col_types = FALSE)
formattable(df_samples,
        table.attr = 'style = "width:50%; font-size:90%;"'
        )
```

## QC for raw reads

Quality Control (QC) is done by running fastQC on raw reads to evaluate how good (or bad) our data looks.

### Install software for Quality Check

**conda**

Set up a virtual environment in conda that has all of the Quality Control (QC) tools. 
We are using [Miniconda](https://docs.anaconda.com/miniconda/). 
For creating sharable environments, create .yaml files for each environment you're making.

```
conda create -n condafastqc fastqc
conda activate condafastqc
conda install -n multiqc
```

**tmux**

For almost every long-ish process, 
we use tmux to run tasks in a detached terminal window that can keep running if we want to log out or if we accidently log out.
```
tmux new -s fastqc
```

### Executing

Creating a new file called **fastqcr.R** that has the script to run fastqc through R.
We use this script to automate the process for a list of files in a directory. 
Running commands through various scripts also keeps a record of our commands.

```
if (!require("fastqcr")) {
        install.packages("fastqcr", repos="http://cran.us.r-project.org")
        library(fastqcr)
} else {library(fastqcr)}
fastqc(fq.dir="/path/to/directory/init_fastq_files",
      qc.dir="/path/to/directory/fastqc/init_fastqc",
      threads=20)
```

Running the script with the following command:
```
Rscript fastqcr.R
```
Press `Ctrl+B` and then `D` for detaching this tmux window. 
You can log out from your remote computer (HPC) and close your terminal 
window now while this process keeps running on your remote computer. 
To get back to this window, log in to your remote computer, and
 use the command `tmux a -t fastqc`. F
or more information, check [tmux](https://github.com/tmux/tmux/wiki/Getting-Started). 

We will get a fastqc HTML file in the output fastqc directory for 
each fastq sequence in the input init_fastq_files directory.

### FastQC Output

**General Statistics from multiqc**
```{r echo=FALSE, results='asis', warning=FALSE}
library(readr)
library(formattable)
df_stats <- read_tsv("multiqc_init_fastq/general_stats_table.tsv", show_col_types = FALSE)
is.num <- sapply(df_stats, is.numeric)
df_stats[is.num] <- lapply(df_stats[is.num], round, 1)
formattable(df_stats, list(
    'PercentDups' = color_bar("#E9967A"),
    'PercentGC' = color_bar("#DAA520"),
    'BPMedianReadLength' = color_bar("#66CDAA"),
    'MillionsSeqs' = color_bar("#DA70D6")),
    table.attr = "class=\'table table-condensed\', style=\'width:80%; font-size:90%;\'"
)
```

**Sequence Counts for each sample**

![Sequence Counts](multiqc_init_fastq/multiqc_plots/fastqc_sequence_counts_plot.png){width=550px}

**The mean quality scores per base in the read**

![Per Base Quality](multiqc_init_fastq/multiqc_plots/fastqc_per_base_sequence_quality_plot.png){width=550px}

**The average quality scores per sequence**

![Per Sequence Quality](multiqc_init_fastq/multiqc_plots/fastqc_per_sequence_quality_scores_plot.png){width=550px}

**The GC content per sequence**

![GC conetent](multiqc_init_fastq/multiqc_plots/fastqc_per_sequence_gc_content_plot.png){width=550px}


**Sequence Duplication**

![Duplication in each sequence](multiqc_init_fastq/multiqc_plots/fastqc_sequence_duplication_levels_plot.png){width=550px}

**Overrepresented sequences**

![Total overrepresented sequences found in each sample](multiqc_init_fastq/multiqc_plots/fastqc_overrepresented_sequences_plot.png){width=550px}

**Adapter Content**

![Adapters found at sequence positions](multiqc_init_fastq/multiqc_plots/fastqc_adapter_content_plot.png){width=550px}

**Overall Pass/Fail**

![Checking if each sample passes or fails in various meteris](multiqc_init_fastq/multiqc_plots/fastqc-status-check-heatmap.png){width=550px}

**Sequence length distribution**

All samples had sequences of a single length (50bp , 36bp , 101bp).

## Trimming

Trimming is done with Trimmomatic in our case, however, other tools like cutadapt or fastp may also be used.
This tool runs through the following bash script

```
FILES="/disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/copied_old_infiles/init_fastq_files/*.fastq.gz"
for f in $FILES
    do
        b=`basename $f`
        echo Running trimming for the file $b

        c=${b::-9}
        o="$c.trim.fastq"
	log="$c.out.log"
        c="$c.log.txt"
        java -jar /disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/trim/Trimmomatic-0.39/trimmomatic-0.39.jar SE -threads 20 -phred33 -trimlog $c $f $o ILLUMINACLIP:/disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/copied_old_infiles/adapter_contamination_sequences_AR3.txt:2:30:5:1:true SLIDINGWINDOW:5:15 LEADING:20 TRAILING:20 MINLEN:15 2> $log
    done
```

![Reads Surviving](trimmomatic/trimmomatic_plot.png){width=550px}

```{r echo=FALSE, results='asis', warning=FALSE}
library(readr)
library(formattable)
df_stats <- read_tsv("trimmomatic/trimmomatic_plot.tsv", show_col_types = FALSE)
is.num <- sapply(df_stats, is.numeric)
df_stats[is.num] <- lapply(df_stats[is.num], round, 1)
formattable(df_stats, list(
    'TotalReads' = color_bar("#BDB76B"),
    'SurvivingReads' = color_bar("#90EE90"),
    'PercentSurviving' = color_bar("#32CD32"),
    'DroppedReads' = color_bar("#F08080"),
    'PercentDropped' = color_bar("#CD5C5C")),
    table.attr = "class=\'table table-condensed\', style=\'width:100%; font-size:90%;\'"
)
```
				

## QC for trimmed reads

Quality Control (QC) is done for the trimmed reads with fastQC. 
We activate the conda environment `condafastqc` for this process.

```
if (!require("fastqcr")) {
        install.packages("fastqcr", repos="http://cran.us.r-project.org")
        library(fastqcr)
} else {library(fastqcr)}
fastqc(fq.dir="/path/to/directory/trimmed_fastq_files",
      qc.dir="/path/to/directory/fastqc/trim_fastqc",
      threads=20)
```

Running the script with the following command:
```
Rscript fastqcr.R
```

### FastQC Output

Plots for each FASTQC metric after trimming is shown below. 
All the adapters were removed and overall quality looks better.

**General Statistics from multiqc**
```{r echo=FALSE, results='asis', warning=FALSE}
library(readr)
library(formattable)
df_stats <- read_tsv("multiqc_trim_fastq/general_stats_table.tsv", show_col_types = FALSE)
is.num <- sapply(df_stats, is.numeric)
df_stats[is.num] <- lapply(df_stats[is.num], round, 1)
formattable(df_stats, list(
    'PercentDups' = color_bar("#E9967A"),
    'PercentGC' = color_bar("#DAA520"),
    'BPMedianReadLength' = color_bar("#66CDAA"),
    'MillionsSeqs' = color_bar("#DA70D6")),
    table.attr = "class=\'table table-condensed\', style=\'width:80%; font-size:90%;\'"
)
```
**Sequence Counts for each sample**

![Sequence Counts](multiqc_trim_fastq/multiqc_plots/fastqc_sequence_counts_plot.png){width=550px}

**The mean quality scores per base in the read**

![Per Base Quality](multiqc_trim_fastq/multiqc_plots/fastqc_per_base_sequence_quality_plot.png){width=550px}

**The average quality scores per sequence**

![Per Sequence Quality](multiqc_trim_fastq/multiqc_plots/fastqc_per_sequence_quality_scores_plot.png){width=550px}

**The GC content per sequence**

![GC conetent](multiqc_trim_fastq/multiqc_plots/fastqc_per_sequence_gc_content_plot.png){width=550px}


**Sequence Duplication**

![Duplication in each sequence](multiqc_trim_fastq/multiqc_plots/fastqc_sequence_duplication_levels_plot.png){width=550px}

**Overrepresented sequences**

![Total overrepresented sequences found in each sample](multiqc_trim_fastq/multiqc_plots/fastqc_overrepresented_sequences_plot.png){width=550px}

**Overall Pass/Fail**

![Checking if each sample passes or fails in various meteris](multiqc_trim_fastq/multiqc_plots/fastqc-status-check-heatmap.png){width=550px}

**Sequence length distribution**

![length distribution of sequences](multiqc_trim_fastq/multiqc_plots/fastqc_sequence_length_distribution_plot.png){width=550px}

## Alignment

The trimmed reads are aligned to the reference genome using bowtie.
To analyze reads mapping with mitochondrial genome, we perform the alignment in two steps.

1. Complete Genome (with mitochondrial genome)- with algenome.fa
2. Only the Mitochondrial genome which is extracted from genome.fa- almitogenome.fa

To begin alignment, we make bowtie indexes, this is done in two sets.

```
bowtie-index algenome.fa
bowtie-index almitogenome.fa
```

### Alignment with genome

The alignment was done on the trimmed reads from previous section.

```
#!/bin/bash
FILES="/disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/trim/*.fastq"
for f in $FILES
    do
        b=`basename $f`
        echo Running trimming for the file $b
        t=${b::-6}
        o="$t.sam"
        uo="$t.mapped.fq"
        echo Output file is $o
        bowtie \
	 -p 20 \
     	 -t \
	 -k 5 \
	 --best \
	 --strata \
	 -e 99999 \
	 -v 0 \
   	 -l 15 \
  	 --chunkmbs 2048 \
	 -x /disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/bowtie_index_alim/algenome \
	 -q $f \
	 --al $uo \
	 --sam --no-unal > $o 2> $t.bowtie.log
    done
#done
```

The following stats show alignment QC. 

![Bowtie Alignment Stats with Genome](alignment/genome/bowtie1_alignment.png){width=550px}

This table shows the number of reads aligned in each sample (MillionAlignedReads). 
It also shows the number of aligments of various reads that includes multimapped reads (MillionMappedReads).

```{r echo=FALSE, results='asis', warning=FALSE}
library(readr)
library(formattable)
df_stats <- read_tsv("alignment/genome/general_stats_table.tsv", show_col_types = FALSE)
is.num <- sapply(df_stats, is.numeric)
df_stats[is.num] <- lapply(df_stats[is.num], round, 1)
formattable(df_stats, list(	
    'PercentAligned' = color_bar("#B0C4DE"),
    'MillionAlignedReads' = color_bar("#E6E6FA"),
    'MillionMappedReads' = color_bar("#778899"),
    table.attr = "class=\'table table-condensed\', style=\'width:80%; font-size:90%;\'")
)
```
### Alignment with mitochondria

The alignment was done on the trimmed reads from previous section.

The following stats show alignment QC. 

```
#!/bin/bash
FILES="/disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/trim/*.fastq"	
for f in $FILES
    do
        b=`basename $f`
        echo Running alignment for the file $b
        t=${b::-6}
        o="$t.sam"
	uo="$t.mapped.fq"
        echo Output file is $o
        bowtie \
	 -p 20 \
     	 -t \
	 -k 10 \
	 --best \
	 --strata \
	 -e 99999 \
	 -v 0 \
   	 -l 15 \
  	 --chunkmbs 2048 \
	 -x /disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/bowtie_mito_index/almitogenome \
	 -q $f \
	 --al $uo \
	 --sam --no-unal > $o 2> $t.bowtie.log
    done
#done
```

![Bowtie Alignment Stats with Mitochondrial Genome](alignment/mitochondria/bowtie1_alignment.png){width=550px}

This table shows the number of reads aligned in each sample (MillionAlignedReads). 
It also shows the number of aligments of various reads that includes multimapped reads (MillionMappedReads).

```{r echo=FALSE, results='asis', warning=FALSE}
library(readr)
library(formattable)
df_stats <- read_tsv("alignment/mitochondria/general_stats_table.tsv", show_col_types = FALSE)
is.num <- sapply(df_stats, is.numeric)
df_stats[is.num] <- lapply(df_stats[is.num], round, 2)
formattable(df_stats, list(	
    'PercentAligned' = color_bar("#B0C4DE"),
    'MillionAlignedReads' = color_bar("#E6E6FA"),
    'MillionMappedReads' = color_bar("#778899"),
    table.attr = "class=\'table table-condensed\', style=\'width:80%; font-size:90%;\'")
)
```

## Using aligned reads
**Coming Soon**

### miRTrace
 Running mirtrace with the following command.
```
#!/bin/bash
mirtrace qc --species dre --config seq_address.txt
```

Looking at the results:

![](mirtrace/mirtrace-length-plot.png){width=100%}
![](mirtrace/mirtrace-rnatype-plot.png){width=100%}

### sports

Executing sports tool on mapped reads only. We use the following command to run:

```
#!/bin/bash
sports.pl -i seqs_alim_mito.txt \
-o /disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/align_with_alimnaeus/gen_input_fastq/SPORTS/out_sports/REDO/mito_mapped_reads \
-k \
-M 2 \
-p 20 \
-g /disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/bowtie_mito_index/almitogenome \
-m /disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/align_with_alimnaeus/gen_input_fastq/SPORTS/sports1.1/db/Danio_rerio/miRBase_21/miRBase_21-dre \
-r /disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/align_with_alimnaeus/gen_input_fastq/SPORTS/sports1.1/db/Danio_rerio/rRNA_db/zebrafish_rRNA \
-t /disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/align_with_alimnaeus/gen_input_fastq/SPORTS/sports1.1/db/Danio_rerio/GtRNAdb/danRer6-tRNAs \
-w /disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/align_with_alimnaeus/gen_input_fastq/SPORTS/sports1.1/db/Danio_rerio/piRBase/piR_dre_v1.0 \
-e /disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/align_with_alimnaeus/gen_input_fastq/SPORTS/sports1.1/db/Danio_rerio/Ensembl/Danio_rerio.GRCz10.ncrna \
-f /disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/align_with_alimnaeus/gen_input_fastq/SPORTS/sports1.1/db/Danio_rerio/Rfam_12.3/Rfam-12.3-zebrafish
```

This will generate a number of output files, from which the tables
generated as a result file *_output.txt for each sample are taken and 
are used to extract annotation of mapped reads.
Top 50 rows of this table are shown for the sample 10_12d_4hrA below. 


![](sports/Screenshotfrom10_12d_4hrA.png){width=550px}

It also generates a number of plots to categorize the annotations 
and their length distributions for each sample.

![](sports/Screenshot10_12d_4hrA.png){width=550px}
![](sports/Screenshot10_12d_4hrA_graph.png){width=550px}

## Compiling output files
**Coming Soon**

## Visualizing
We use the alignments with mitochondrial genome to view the aligned reads.

Jbrowse is used to explore alignments.

Files used:

1. Aligned bam files for each sample and its corresponding index file (bam, bai)
3. Reference genome for mitochondria (.fasta) and its index file (.fai)
4. Genome feature file (.gff) for mitochondrial genome

![](jbrowse/jbrowse_4d_t0.svg){width=100%}

This figure shows mitochondiral genome of 100bp length from 17100 to 17199. On this genome, we see 
1) the colored basepairs, 2) Genomic features of the mentioned length, 3) Alignments from 4 samples (4d t0 replicates)
that show reads aligned to the genome colored by strand while also showing the coverage histogram in grey. 